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ABSTRACT 

This  paper  reports  our  participation  in  the  Federated  Web 
Search  Track  in  TREC  2014.  We  submitted  21  runs  for  all 
the  three  tasks:  Vertical  Selection  (7),  Resource  Selection 
(7)  and  Results  Merging  (7).  Our  main  purpose  is  to  test 
several  established  resource  selection  methods  on  the  new  re¬ 
alistic  FedWeb  test  collections.  We  evaluated  7  well  known 
resource  selection  methods  for  the  vertical  selection  and  re¬ 
source  selection  tasks.  The  effectiveness  of  these  methods  in 
the  RS  tasks  does  not  carry  to  the  VS  tasks,  which  implies 
that  more  sophisticated  algorithms  and  more  diverse  sources 
of  evidence  are  needed  for  solving  the  VS  task  effectively. 
Our  Results  Merging  experiments  reveal  the  correlation  be¬ 
tween  the  performance  of  RM  and  the  performance  of  its 
input  RS  results. 

1.  INTRODUCTION 

Federated  Web  Search  is  the  task  of  searching  multiple 
search  engines  simultaneously  and  combining  their  results  in 
a  coherent  way  for  presenting  to  the  end  user.  The  Federated 
Web  Search  Track  2014  (FedWeb  2014),  with  its  precedent, 
FedWeb  2013  [4],  features  realistic  web  test  collections  for 
the  federated  web  search  task.  In  addition  to  the  Resource 
Selection  (RS)  and  Results  Merging  (RM)  tasks  in  FedWeb 
2013,  FedWeb  2014  introduced  a  new  task,  the  Vertical  Se¬ 
lection  (VS)  task. 

This  is  our  first  participation  in  the  Federated  Web  Search 
track.  In  this  year’s  tasks,  our  main  purpose  is  to  evaluate 
several  established  resource  selection  methods  on  the  new 
Federated  Web  Search  test  collections.  Though  our  focus  is 
on  the  RS  task,  we  also  submitted  runs  for  the  VS  and  RM 
tasks. 


2.  RESOURCE  SELECTION  IN  FEDERATED 
SEARCH 

In  a  federated  search  environment,  it  is  generally  desirable 
to  query  only  a  subset  of  all  the  available  resources.  Often, 
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this  is  considered  from  efficiency  point  of  view,  as  a  selective 
search  strategy  generally  means  quicker  search  response  and 
lower  latency.  Moreover,  a  recent  study  shows  that  search 
effectiveness  would  not  be  reduced  even  when  searches  are 
conducted  selectively,  in  particular  given  the  sources  are  par¬ 
titioned  or  distributed  properly [5].  The  goal  of  RS  is  then, 
for  a  given  query,  to  select  only  the  most  promising  search 
engines  from  all  those  available. 

Most  existing  methods  for  RS  can  be  categorized  into  large 
document  approaches,  small  document  approaches,  or  clas¬ 
sification  based  approaches  [6].  In  our  experiments,  we  em¬ 
ploy  several  small  document  approaches  for  Resource  Selec¬ 
tion  task.  Small  document  approaches  rely  on  a  centralized 
sample  index  (CSI)  of  the  all  the  sampled  documents  from 
each  sources.  For  a  given  query,  search  results  on  CSI  are 
used  to  estimate  the  score  of  a  particular  resource.  Different 
small  document  approaches  vary  in  terms  of  how  they  use 
the  search  results.  The  following  methods  are  used  in  our 
experiments. 

2.1  ReDDE 

ReDDE  proposed  by  Si  and  Callan  is  arguably  the  most 
influential  small  document  approach  for  resource  selection[ll] 
For  a  given  query,  ReDDE  estimates  the  quality  of  resources 
based  on  how  relevant  documents  are  distributed  in  the 
search  results  from  the  CSI.  Generally,  top  k  ranked  doc¬ 
uments  are  assumed  to  be  relevant.  Given  sample  S  and  its 
source  resource  R ,  ReDDE  assumes  each  document  in  the 
sample  represents  |^|  documents  in  the  source,  where  \R\, 
IS)  are  the  sizes  of  R  and  S  respectively.  It  should  be  noted 
that  in  the  original  ReDDE,  each  document  of  the  sampled 
index  represents  a  fixed  score  for  the  source  document.  The 
score  for  a  given  resource  is  calculated  by  counting  the  num¬ 
ber  of  documents  from  it  in  the  top  k  search  results,  and  then 
times  the  scaling  factor 


ReDDE(R|?)  =  j|i  ■  Y^di  e  «)•  (1) 

Later,  ReDDE. top  [1]  is  proposed  by  Arguello  to  replace  the 
fixed  score  with  the  actual  retrieval  score  of  a  document  in 
the  search  result: 


ReDDE.  top(Rjq) 


w 

l-si 


k 

■  Yx(di  e  -R)RSV(di), 

i=  1 


(2) 


where  RSV(d;)  is  the  retrieval  status  value  of  di,  e.g.  P(di\q) 
in  the  case  of  using  language  model  as  the  retrieval  model. 
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2.2  CRCS 

The  Central-Rank-based  Collection  Selection  (CRCS)  ap¬ 
proach  [10]  proposed  by  Shokouhi  uses  the  rank  of  a  top 
k  retrieved  document  to  derive  its  contribution  to  the  cal¬ 
culation  of  the  relevance  of  a  resource  to  the  given  query. 
It  uses  either  a  linear  or  a  negative  exponential  function  to 
convert  the  document  rank  to  a  score,  which  is  then  summed 
in  a  similar  manner  as  ReDDE  to  determine  the  score  of  the 
resource.  This  results  CRCSlinear  and  CRCSExp  as  two 
versions  of  the  CRCS  algorithm. 

2.3  SUSHI  and  CiSS 

Contrary  to  ReDDE  and  CRCS  which  use  only  rank  infor¬ 
mation  of  sampled  documents,  SUSHI  [12]  and  CiSS  [9]  used 
the  actual  relevance  scores  of  the  sampled  documents  to  de¬ 
rived  the  relevance  of  the  sources.  SUSHI  fits  the  scores  of 
documents  from  a  particular  resource  to  a  smooth  curve,  and 
ranks  resources  via  maximizing  certain  metric,  e.g.  P@10. 
SUSHI  intentionally  selects  fewer  resources  than  ReDDE 
and  CRCS  methods.  To  score  a  resource,  CiSS  gathers  doc¬ 
uments  belong  to  that  resource  in  the  search  result  list,  and 
generates  a  new  rank  of  them  based  on  their  relative  order. 
Then  the  document  scores  and  their  new  ranks  are  trans¬ 
formed  using  exponential  function  and  logarithmic  function 
respectively.  A  linear  function  is  used  to  fit  documents  in 
the  space  with  log-transformed  ranks  being  the  x-axis  and 
exponentially  transformed  document  scores  being  the  y-axis. 
The  resource  score  is  then  an  integral  over  this  curve. 


BM25  retrieval  model  (BM25+PlainQ),  MRF  sequential  de¬ 
pendence  query  model  with  language  model  (LM+MRF-SD- 

Q). 

4.  EXPERIMENTS  AND  RESULTS 
4.1  Resource  Selection 

The  purpose  of  the  RS  task  is  to  predict  the  quality  of 
individual  resources  for  given  topics.  It  is  required  that 
all  the  resources  should  be  ranked  for  a  given  search  topic, 
with  more  relevant  resources  being  ranked  higher.  Our  RS 
procedure  used  the  following  seven  RS  methods:  ReDDE, 
ReDDE. top,  CRCSLinear,  CRCSExp,  CiSS,  CiSSAprox,  SUSHI. 
All  of  these  small  document  RS  approach  have  reference  im¬ 
plementations  in  the  LiDR  library2  by  Ilya  Markov  [6].  It  is 
noted  that  many  of  these  algorithms  require  the  size  of  the 
resource  to  approximate  the  complete  ranking  with  the  sam¬ 
pled  search  results.  In  our  case,  size  of  most  involved  search 
engines  are  not  available,  therefore  we  took  a  bold  assump¬ 
tion  on  the  approximation  issue  by  setting  the  proportion  of 
resource  size  to  sample  size  to  a  constant  for  all  resources 
such  that  it  would  not  affect  the  ranking  of  resources. 

With  the  3  retrieval  setups  detailed  in  Section  3  and  7  RS 
methods  introduced  in  Section  2,  there  are  21  RS  run  set¬ 
tings  in  total.  We  first  run  all  our  settings  on  the  FedWebl3 
collection,  and  then  choose  the  top  7  run  settings  for  our 
FedWebl4  submissions. 

Table  1  shows  our  submitted  results: 


3.  DATASET  AND  RETRIEVAL  SETUP 

The  FedWebl4  test  collection,  created  by  the  University 
of  Twente  group,  is  used  in  this  year  Federated  Web  Search 
track  [4,  8[.  It  consists  of  snippets  and  documents  sampled 
from  search  result  pages  of  149  search  engines.  4000  queries 
are  used  in  building  the  sample  set.  As  a  part  of  the  Vertical 
Selection  task,  search  engines  are  categorized  into  24  verti¬ 
cals,  such  as  General,  Video,  Jobs,  Academic,  and  so  on.  It 
is  noted  that  each  search  engine  belongs  to  only  one  ver¬ 
tical.  Previous  federated  web  search  experiments  generally 
run  on  dataset  collection,  customized  by  reusing  existing  IR 
test  collections.  The  FedWebl3  and  FedWebl4  test  collec¬ 
tions  are  crawled  directly  from  different  vertical  search  en¬ 
gines,  making  them  more  realistic.  To  our  best  knowledge, 
no  work  has  been  done  to  test  established  resource  selection 
methods  on  them. 

We  created  a  centralized  sample  index  (CSI)  of  all  the 
sampled  documents.  Our  index  is  built  with  the  Indri  Toolkit1 , 
using  the  Krovetz  stemmer  and  not  removing  any  stop  words. 

For  both  the  VS  and  RS  tasks,  the  inputs  are  generated 
through  the  following  procedure:  retrieval  top  1000  docu¬ 
ments  from  CSI  for  each  topic.  For  the  retrieval,  we  used 
two  kinds  of  retrieval  models  and  two  kinds  of  query  mod¬ 
eling.  Of  the  retrieval  models,  one  is  BM25  retrieval  model 
with  k  =  1.2  and  b  =  0.75,  the  other  is  language  model 
with  Dirichlet  smoothing  and  fj,  =  1350  which  is  about  the 
average  document  length  in  the  CSI.  Of  the  query  models, 
one  uses  the  plain  query  terms  (PlainQ),  the  other  uses  the 
Markov  Random  Field  Model’s  sequential  dependency  query 
model  (MRF-SD-Q)  [7].  This  results  the  following  three  set 
of  retrieval  results  for  the  topics:  plain  query  terms  with  lan¬ 
guage  model  (LM+PlainQ),  plain  query  terms  with  Okapi 


runID 

nDCG@20 

nDCGQIO 

nP@l 

nP@5 

drexelRSl 

0.389 

0.348 

0.222 

0.318 

drexelRS2 

0.328 

0.227 

0.125 

0.180 

drexelRS3 

0.333 

0.229 

0.125 

0.179 

drexelRS4 

0.333 

0.229 

0.125 

0.180 

drexelRS5 

0.342 

0.241 

0.135 

0.211 

drexelRSG 

0.382 

0.284 

0.201 

0.250 

drexelRS7 

0.422 

0.359 

0.293 

0.314 

Table  1:  Resource  Selection  Results 


drexelRSl 

drexelRS2 

drexelRS3 

drexelRS4 

drexelRS5 

drexelRS6 

drexelRS7 


LM+PlainQ+CRCSExp 

LM+PlainQ+RcDDE 

LM+PlainQ+CiSSAprox 

LM+PlainQ+CiSS 

BM25+PlainQ-|-CRCSLinear 

LM+MRF-SD-Q+RcDDETop 

LM+MRF-SD-Q+SUSHI 


nP@l  and  nP@5  are  the  normalized  graded  precision  mea¬ 
sures  introduce  in  [4[. 

Based  on  our  submitted  results,  SUSHI  with  language 
model  and  sequential  dependency  queries  performs  the  best 
among  all  the  submitted  settings  in  terms  of  nDCG@20, 
nDCG@10  and  nP@l.  CRCSExp  with  language  model  and 
plain  queries  performs  best  in  terms  of  nP@5. 

A  query  by  query  comparison  between  the  best  performed 
runs,  drexelRS7  and  drexelRSl,  shows  that  even  though 
drexelRS7  outperforms  drexelRSl  in  nDCG@20,  both  of  the 
two  outperforms  the  other  in  half  of  the  topics  (Figure  1). 

With  the  released  RS  qrels  data,  we  analyzed  all  our  21 
runs  and  report  nDCG@20  and  nDCG@10  for  all  the  21 


1  http :  / /www.  lemurproj  ect.org/indri .  php 


2https:/ /github.com/markovi/LiDR 
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Topics 

Figure  1:  nDCG@20  differences  between  drexelRS7 
and  drexelRSl  among  topics;  positive  bars  indicate 
drexelRS7  works  better  for  that  topic  and  negative 
bars  worse. 


runs  in  Table  2.  For  the  three  retrieval  settings,  SUSHI  per¬ 
forms  best  in  two  of  them,  and  CiSSApprox  performs  best 
in  the  rest  BM25  retrieval  model  setting.  The  performance 
of  CRCS  related  methods  is  more  robust  across  different  se¬ 
tups  than  others,  which  is  consistent  with  earlier  findings  in 

[13]- 


runID 

nDCG@20 

nDCG@10 

mrfsd-lm-CRCSExp 

0.3911 

0.3450 

mrfsd-lm-CRCSLinear 

0.3618 

0.2492 

mrfsd-lm-CiSS 

0.3487 

0.2287 

mrfsd-lm-CiSS  Approx 

0.3496 

0.2289 

mrfsd-lm-ReDDE 

0.3464 

0.2287 

mrfsd-lm-ReDDETop 

0.3821 

0.2844 

mrfsd-lm-SUSHI 

0.4224 

0.3591 

plain-  lm-  C  RC  SExp 

0.3889 

0.3477 

plain-  lm-  C  RC  SLinear 

0.3498 

0.2406 

plain-lm-CiSS 

0.3325 

0.2289 

plain-  lm-  C  iS  S  Approx 

0.3325 

0.2288 

plain-  lm-RcDDE 

0.3276 

0.2268 

plain-  lm-RcDDETop 

0.3452 

0.2424 

plain- lm-SUSHI 

0.4047 

0.3163 

plain-bm25-CRCSExp 

0.3796 

0.3238 

plain-bm25-CRCSLinear 

0.3423 

0.2414 

plain-bm25-CiSS 

0.3858 

0.2927 

plain-bm25-CiSS  Approx 

0.4095 

0.3153 

plain-bm25-RcDDE 

0.3405 

0.2307 

plain-bm25-RcDDETop 

0.3479 

0.2349 

plain-bm2  5-SUSHI 

0.3336 

0.2422 

Table  2:  Performance  of  all  21  RS  runs 

More  in-depth  study  can  be  done  to  investigate  the  con¬ 
tributions  of  different  factors,  i.e.  query  model,  retrieval 
model,  and  RS  algorithm,  to  the  differences  in  IR  metrics. 

4.2  Vertical  Selection 

In  web  search,  verticals  can  be  defined  by  topic,  e.g.  weather, 
sports,  etc.,  or  by  media  type,  e.g.  image,  video,  etc.,  or  by 


genre  of  content,  e.g.  news,  blogs,  encyclopedia,  etc.  The 
user’s  query  may  have  a  strong  indication  of  vertical  intent, 
e.g.  ’’arrow  icon”,  which  is  clearly  oriented  to  the  image 
vertical,  or  is  intrinsically  ambiguous,  e.g.  ’’Barack  Obama”, 
which  may  be  associated  with  verticals  such  as  encyclopedia, 
news,  general  web  and  so  on.  In  these  scenarios,  presenting 
search  results  from  multiple  relevant  verticals  is  desirable 
and  would  improve  users’  satisfaction  of  the  search  service. 

The  task  of  vertical  selection  is  to  predict  and  rank  the 
verticals  for  a  given  query.  A  vertical  is  relevant  to  a  query 
can  be  interpreted  in  two  senses.  First,  the  vertical  is  over¬ 
all  aligned  to  the  user’s  search  intent.  Second,  the  vertical 
has  many  relevant  documents  for  the  user’s  query.  Zhou 
et.  al.  recently  empirically  showed  that  the  two  correlate 
well  with  each  other.  Therefore,  the  ground  truth  relevant 
vertical  sets  can  be  determined  based  on  the  vertical  col¬ 
lection  relevance  [14],  The  source  of  evidences  for  vertical 
selection  may  include  query  string,  vertical-representative 
corpora,  and  query  log  associated  with  the  vertical  and  so 
on[2]. 

In  this  year’s  work,  we  approach  the  VS  task  in  the  same 
way  as  the  RS  task.  Each  vertical  is  treated  as  a  single 
resource;  all  the  returned  results  belong  to  the  resources 
of  a  particular  vertical  are  treated  as  being  from  the  same 
source.  Then  the  general  resource  selection  procedures  are 
applied  on  these  verticals.  Because  in  the  vertical  selection 
task,  only  a  subset  of  verticals  should  be  returned,  we  there¬ 
fore  applied  a  threshold  in  selecting  only  the  top  verticals. 
With  the  normalized  scores  of  verticals  for  each  query,  we 
set  a  cutoff  threshold  only  selecting  verticals  that  by  select¬ 
ing  which  the  discounted  gain  is  beyond  the  threshold.  In 
the  submitted  runs,  this  threshold  value  is  set  to  0.01.  Table 
3  shows  the  performance  of  our  submitted  runs. 


runID 

Precision 

Recall 

FI 

drexelVSl 

0.240 

0.506 

0.284 

drexelVS2 

0.159 

0.824 

0.233 

drexelVS3 

0.134 

0.960 

0.212 

drexelVS4 

0.134 

0.960 

0.212 

drexelVS5 

0.163 

0.824 

0.244 

drexelVS6 

0.171 

0.729 

0.251 

drexelVS7 

0.189 

0.732 

0.271 

Table  3:  Vertical  Selection  Results 


drexelVSl 

drexelVS2 

drexelVS3 

drexelVS4 

drexelVS5 

drexelVSO 

drexelVS7 


LM+PlainQ+CRCSExp 

LM+PlainQ+ReDDE 

LM+PlainQ+CiSSAprox 

LM+PlainQ+CiSS 

BM25+PlainQ-|-CRCSLinear 

LM+MRF-SD-Q+ReDDETop 

LM+MRF-SD-Q+SUSHI 


CRCSExp  with  language  model  and  plain  query  achieved 
the  highest  precision  and  FI  scores.  Overall,  our  approach 
are  among  the  medianly  performed  submissions,  perhaps 
due  to  to  relatively  low  precision.  With  the  release  of  the 
qrels  for  VS,  we  investigated  whether  increasing  the  cut-off 
threshold  for  VS  will  increase  FI  score.  Figure  2  shows  our 
results  that  sweep  threshold  value  from  0.01  to  0.5.  Some 
algorithms  such  as  CiSS  and  CRCSLinear,  witness  an  in¬ 
crease  of  FI  at  some  point,  and  many  other  algorithms  do 
not.  Our  experiments  indicated  that  naively  treating  verti- 
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cal  selection  task  as  a  traditional  resource  selection  task  is 
not  very  effective. 


0  0.05  0.1  0.15  0.2  0.25  0.3  0.35  0.4  0.45  0.5 

Threshold 

drexeIVSI  (plain-lm-CRCSExp)  — * — 

drexelVS2  (plain-lm-ReDDE) 

drexelVS3  (plain-lm-CiSSApprox) 

drexelVS4  (plain-lm-CiSS)  — •— 

drexelVS5  (plain-bm25-CRCSLinear) 

drexelVS6  (mrfsd-lm-ReDDETop) 

drexelVS7  (mrfsd-lm-SUSHI) 

Figure  2:  Change  of  FI  as  threshold  is  changed  from 
0.01  to  0.51 

4.3  Results  Merging 

The  Results  Merging  (RM)  task  is  to  merge  search  result 
snippets  from  resources  selected  at  the  RS  stage  into  a  single 
rank  ordered  list.  The  track  organizer  provides  topic  search 
snippets  from  the  149  search  engines  for  75  topics.  Therefore 
for  each  topic,  there  are  149  sets  of  snippets  organized  based 
on  the  resources,  and  for  each  resource  there  are  75  sets  of 
snippets  organized  based  on  the  topics.  During  the  merging 
stage,  only  the  top  20  resources  can  be  selected  as  the  sources 
of  snippets  to  be  merged.  A  baseline  RS  result  is  provided 
by  the  organizer  and  required  to  be  the  input  of  at  least  one 
submitted  RM  run. 

There  exist  mainly  two  kinds  of  approaches  of  doing  result 
merging:  score  based  and  rank  based  approaches.  Previous 
researches  show  that  rank-based  approaches  such  as  Recip¬ 
rocal  Rank  Fusion  (RRF)  [3]  generally  outperforms  score 
based  approaches.  In  our  case,  there  is  no  score  information 
provided  for  the  snippets,  therefore  rank-based  approach  be¬ 
comes  the  natural  choice. 

Our  solution  to  the  result  merging  task  is  to  leverage  the 
reciprocal  rank  (RR)  of  a  document  as  the  basic  retrieval 
status  value  (RSV)  for  a  given  snippet.  For  a  given  query  q, 
the  RR  of  a  document  d  from  the  results  of  a  resource  Ri  is 
given  by: 

RR(^B*>  =  jb+W  ^ 

where  r(d)  is  d’s  rank  in  the  result  list,  and  k  is  generally 
set  to  60. 

This  score  is  further  weighted  based  on  the  score  or  recip¬ 
rocal  rank  of  the  selected  resource.  Document  score  weighted 


by  selected  resource  score  is: 

Score(d\q,  Ri)  =  RS(Ri|g)  x  RR(d|g,  Ri)  (4) 

where  RS(Ri|g)  is  the  score  of  resource  Ri  from  the  RS 
stage.  Document  score  weighted  by  selected  resource  re¬ 
ciprocal  rank  is: 

Scorerank(d|g,  Ri)  =  — - C  .  x  RR(d\q,Ri)  (5) 

ttbrank(,iti|9J 

where  RSrank(Ri|g)  is  the  rank  of  resource  Ri  from  the  RS 
stage,  and  c  is  a  constant. 

The  above  score  is  used  to  output  the  final  merged  docu¬ 
ment  ranking  list  for  a  given  query.  It  should  be  noted,  we 
did  not  consider  duplication  in  the  submitted  runs. 

Other  than  the  runs  based  on  the  baseline  resource  list 
from  the  organizer,  we  submitted  5  runs  based  on  our  re¬ 
source  selection  results.  The  final  results  are  shown  in  Table 
4;  the  runID  prefix  indicates  its  corresponding  resource  se¬ 
lection  run,  and  the  tailing  W  or  R  indicates  whether  it  is 
based  on  resource  score  (W)  or  resource  reciprocal  rank  (R). 

From  our  results,  we  can  see  that  the  baseline  resource 
list  outperforms  our  RS  results.  With  the  qrels  of  the  RS 
task,  we  find  out  the  nDCG@20  and  nDCG@10  for  the 
baseline  RS  run  is  0.428  and  0.372,  respectively.  For  our 
best  RS  run  drexelRS7,  the  nDCG@20  and  nDCG@10  are 
0.422  and  0.359,  which  is  rather  close  to  the  baseline  RS 
run.  The  nDCG@20  and  nDCG@10  of  their  corresponding 
RM  runs,  FW14basemW  and  drexelRS7mW,  are  also  very 
close.  Therefore,  there  is  a  high  possibility  that  performance 
of  RM  correlated  with  the  performance  of  RS  in  our  current 
methodology.  More  thorough  analysis  need  to  be  done  to 
confirm  this  conjecture. 

Between  the  two  weighting  schemes,  based  on  selected  re¬ 
source  score  or  reciprocal  rank,  the  latter  generally  perfor¬ 
mances  better  than  the  former. 

5.  CONCLUSION  AND  FUTURE  WORK 

We  described  here  the  21  runs  we  submitted  to  the  Fed¬ 
erated  Web  Search  track  in  TREC  2014.  We  evaluated  7 
well  known  resource  selection  methods  for  the  vertical  se¬ 
lection  and  resource  selection  tasks.  The  effectiveness  of 
these  methods  in  the  RS  tasks  does  not  carry  to  the  VS 
tasks,  which  implies  that  more  sophisticated  algorithms  and 
more  diverse  sources  of  evidence  are  needed  for  solving  the 
VS  task  effectively.  Our  Results  Merging  experiments  re¬ 
veal  the  correlation  between  the  performance  of  RM  and 
the  performance  of  its  input  RS  results. 

More  in-depth  and  comprehensive  analysis  and  compari¬ 
son  of  the  all  the  runs,  including  submitted,  not  submitted 
and  post-mortem,  are  planned  on  the  realistic  and  valuable 
FedWebl3  and  FedWebl4  test  collections. 
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